Goto

Collaborating Authors

 data extraction


TaskEval: Synthesised Evaluation for Foundation-Model Tasks

Widanapathiranage, Dilani, Barnett, Scott, Kurniawan, Stefanus, Takerngsaksiri, Wannita

arXiv.org Artificial Intelligence

Hallucinations are a key concern when creating applications that rely on Foundation models (FMs). Understanding where and how these subtle failures occur in an application relies on evaluation methods known as \textit{evals}. Prior work focuses on defining new eval methods or benchmark datasets for specific tasks. However, neither helps a software team with a task-specific FM application when there is no metric or dataset. The demand for both automated approaches and deep integration of human insight makes this a challenging problem. We address this gap by proposing an approach to synthesise a FM task-specific evaluator program that provides automation and a custom UI for capturing feedback. The core novelty of our approach lies in: (1) a task-agnostic meta-model that captures properties of any FM task, (2) an interaction protocol for efficient use of human feedback, and (3) an eval synthesiser that selects or generates an appropriate set of evals. We implement our approach in \toolname and demonstrate the concept on two diverse FM tasks: chart data extraction and document question answering. A preliminary evaluation on the quality of our selected evals shows 93\% and 90\% accuracy respectively. Our research tackles a growing problem facing engineering teams, how to evaluate and review outputs from FM tasks.


ARETE: an R package for Automated REtrieval from TExt with large language models

Branco, Vasco V., Benedek, Jandó, Pivovarova, Lidia, Correia, Luís, Cardoso, Pedro

arXiv.org Artificial Intelligence

1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.


SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model

Wang, Xingbo, Huey, Samantha L., Sheng, Rui, Mehta, Saurabh, Wang, Fei

arXiv.org Artificial Intelligence

The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at https://github.com/xingbow/SciDaEx


ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

Roy, Aritra, Grisan, Enrico, Buckeridge, John, Gattinoni, Chiara

arXiv.org Artificial Intelligence

Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.


PoseGaze-AHP: A Knowledge-Based 3D Dataset for AI-Driven Ocular and Postural Diagnosis

Al-Dabet, Saja, Turaev, Sherzod, Zaki, Nazar, Khan, Arif O., Eldweik, Luai

arXiv.org Artificial Intelligence

Diagnosing ocular - induced abnormal head posture (AHP) requires a comprehensive analysis of both head pose and ocular movements. However, existing datasets focus on these aspects separately, limiting the development of integrated diagnostic approaches and r estricting AI - driven advancements in AHP analysis. T o address this gap, we introduce PoseGaze - AHP, a novel 3D dataset that synchronously captures head pose and gaze movement information for ocular - induced AHP assessment. Structured clinical data were extra cted from medical literature using large language models (LLMs) through an iterative process with the Claude 3.5 Sonnet model, combining stepwise, hierarchical, and complex prompting strategies. The extracted records were systematically imputed and transfo rmed into 3D representations using the Neural Head Avatar (NHA) framework. The dataset includes 7,920 images generated from two head textures, covering a broad spectrum of ocular conditions. The extraction method achieved an overall accuracy of 91.92%, dem onstrating its reliability for clinical dataset construction. PoseGaze - AHP is the first publicly available resource tailored for AI - driven ocular - induced AHP diagnosis, supporting the development of accurate and privacy - compliant diagnostic tools .


Automatic Building Code Review: A Case Study

Wan, Hanlong, Xu, Weili, Rosenberg, Michael, Zhang, Jian, Siddika, Aysha

arXiv.org Artificial Intelligence

Building officials, especially those in resource - constrained or rural jurisdictions, struggle with labor - intensive, error - prone, and costly manual reviews of design documents as projects scale in size and complexity. Widespread adoption of Building Information Modeling (BIM) and Large Language Models (LLMs) has created opportunities for automated code review (AC R) solutions . This study proposes a novel agent - driven framework that integrates BIM - based data extraction with automated verification using both re trieval - augmented generation (RAG) and Model Context Protocol (MCP) agent pipelines. The framework employs LLM - enabled agents to extract geometry, schedules, and system attributes from heterogeneous file types, which are then processed for building code checking via two complementary mechanisms: (i) direct API calls to DOE's COMcheck engine, providing deterministic and audit - ready outputs, and (ii) RAG - based reasoning over rule provisions, allowing flexible interpretation where coverage is incomplete or amb iguous . The framework was evaluated through case demonstrations, including automated extraction of geometric attributes (e.g., surface area, tilt, and insulation values), parsing of operational schedules, and design validation for lighting allowances under ASHRAE Standard 90.1 - 2022. Comparative performance tests across multiple large language models showed that Generative Pre - trained Transformer 4 Omni (GPT - 4o) achieved the best balance of efficiency and stability, while smaller models exhibited inconsistenc ies or failure s . Results confirm that MCP agent pipelines perform better than RAG reasoning pipelines on rigor and flexibility in workflows.


Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX

Vepreva, Anastasia, Razlivina, Julia, Eremeeva, Maria, Gubina, Nina, Orlova, Anastasia, Dmitrenko, Aleksei, Kapranova, Ksenya, Jyakhwo, Susan, Vasilev, Nikita, Sarkisyan, Arsen, Chernyshov, Ivan Yu., Vinogradov, Vladimir, Dmitrenko, Andrei

arXiv.org Artificial Intelligence

The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.


The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Romanov, Valentin, Niederer, Steven A

arXiv.org Artificial Intelligence

Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.


zERExtractor:An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature

Zhou, Rui, Ma, Haohui, Xin, Tianle, Zou, Lixin, Hu, Qiuyue, Cheng, Hongxi, Lin, Mingzhi, Guo, Jingjing, Wang, Sheng, Zhang, Guoqing, Wei, Yanjie, Zheng, Liangzhen

arXiv.org Artificial Intelligence

The rapid expansion of enzyme kinetics literature has outpaced the curation capabilities of major biochemical databases, creating a substantial barrier to AI-driven modeling and knowledge discovery. We present zERExtractor, an automated and extensible platform for comprehensive extraction of enzyme-catalyzed reaction and activity data from scientific literature. zERExtractor features a unified, modular architecture that supports plug-and-play integration of state-of-the-art models, including large language models (LLMs), as interchangeable components, enabling continuous system evolution alongside advances in AI. Our pipeline combines domain-adapted deep learning, advanced OCR, semantic entity recognition, and prompt-driven LLM modules, together with human expert corrections, to extract kinetic parameters (e.g., kcat, Km), enzyme sequences, substrate SMILES, experimental conditions, and molecular diagrams from heterogeneous document formats. Through active learning strategies integrating AI-assisted annotation, expert validation, and iterative refinement, the system adapts rapidly to new data sources. We also release a large benchmark dataset comprising over 1,000 annotated tables and 5,000 biological fields from 270 P450-related enzymology publications. Benchmarking demonstrates that zERExtractor consistently outperforms existing baselines in table recognition (Acc 89.9%), molecular image interpretation (up to 99.1%), and relation extraction (accuracy 94.2%). zERExtractor bridges the longstanding data gap in enzyme kinetics with a flexible, plugin-ready framework and high-fidelity extraction, laying the groundwork for future AI-powered enzyme modeling and biochemical knowledge discovery.


What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

Li, Lingbo, Mathrani, Anuradha, Susnjak, Teo

arXiv.org Artificial Intelligence

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.